I can has speech? What data WhisperSpeech needs?

WhisperSpeech is trained on heavily preprocessed speech data generated from several models:

WhisperSpeech TTS overview diagram

Who is who? A high-level overview

To get these 3 data representations we have to run the audio data through several models. The first two steps are always the same, the rest depend on the model we want to run.

  1. We start by downloading the speech audio files into a sharded webdataset (e.g. A3. Download Project Guttenberg audiobooks).
    We released webdatasetified versions of two important public domain speech datasets – LibriLight and Project Gutenberg Audiobooks.

  2. All subsequent steps rely on voice activity detection (VAD) and diarization so we always generate segment lists and extract speaker embeddings for all audio files (see 1B. Voice activity detection and 2A. Speaker Embeddings for source code).
    The results of this step were also released on Hugging Face – LibriLight and Project Gutenberg Audiobooks.

The next steps depend on which model we want to train or fine-tune.

  1. To re-train the quantized Whisper model we need to transcribe the audio with base.en (2A. Whisper quantization dataset preparation). A model pretrained on 60k hours of LibriLight is available from Hugging Face whisper-vq-stoks-v2.model.
  2. To train the text to semantic token model we need to transcribe the audio with Whisper small.en and extract the semantic tokens (5A. T2S dataset preparation).
  3. To train the semantic to acoustic model we need to extract the semantic tokens and compress the audio with Encodec for the semantic to acoustic model (4A. S2A dataset preparation).

These three steps are all independent since they require different chunking of speech data. For quantizing Whisper and S2A training we greedily merge the VAD segments from the same speaker into (at most) 30 second chunks to improve training performance (more uniform chunks mean less computation time is spent on padding). For T2S we randomly truncate when merging the VAD segments so the model also learns how to work with shorter texts. The code to perform this is in 1C. VAD merging.

TL;DR example – give me the codes!

In this example we will convert a single split from the Multilingual Libri Speech dataset.

Prepare the webdataset shards

The first, most time-consuming, step is to convert the data from it’s original form into the webdataset format. If you want to skip this section and still follow along, the results can be downloaded from Hugging Face at datasets/collabora/multilingual-librispeech-webdataset.

First we need tarp which is a tool that helps create and manipulate the webdataset tar files more effectively. You can check out more about it in the official tarp README

go install -v github.com/collabora/tarp/tarp@latest

Afterwards, we download and unpack the original dataset files:

aria2c -x10 https://dl.fbaipublicfiles.com/mls/mls_french_opus.tar.gz
tar -xf mls_french_opus.tar.gz

Next, we’ll need to convert each line in the transcripts.txt file:

10065_10039_000000      ses vêtements devinrent tout brillants de lumière et blancs comme la neige en sorte qu'il n'y a point de foulon sur la terre qui puisse en faire d'aussi blancs

into a tarp script:

train/10065_10039_000000.opus file:mls_french_opus/train/audio/10065/10039/10065_10039_000000.opus
train/10065_10039_000000.txt text:ses vêtements devinrent tout brillants de lumière et blancs comme la neige en sorte qu'il n'y a point de foulon sur la terre qui puisse en faire d'aussi blancs

We can achieve this using a short Python script (saved as make-script.py):

import sys

fname = sys.argv[1]
dir, split, _ = fname.rsplit("/", 2)

for ln in open(fname):
    id, txt = ln.split("\t")
    a,b,c = id.split("_")
    txt = txt.replace("\n", "")
    print(f"""{split}/{id}.opus file:{dir}/{split}/audio/{a}/{b}/{id}.opus
{split}/{id}.txt text:{txt}""")

Once we have this, we can run the conversion process. The python script outputs data sample descriptions which are fed to tarp create that archives them into a tar stream (a bit similar to tar -T -). The tarp split will then cut the incoming stream into 2GB shards and save them to separate files, making sure to split on sample boundaries.

The 2GB size was chosen as a good compromise between the shard count and shard transcription time for mp3/opus files with mutlilingual speech. For LibriLight (English compressed with FLAC) the magic number was 5GB because we FLAC compresses less and we can also use a smaller model for transcribing English speech.

python3 make-script.py  mls_french_opus/train/transcripts.txt \
  | /root/go/bin/tarp create -o - - \
  | /root/go/bin/tarp split -s 2e9 -o 'mls_french_train-audio-%06d.tar' -

We’ll have to repeat the same command two times replacing train with test and dev and afterwards we can upload everything to Hugging Face:

huggingface-cli login
huggingface-cli upload --repo-type dataset collabora/multilingual-librispeech-webdataset .

Process the shards on a single GPU machine

We do the sharding mainly to be able to effectively process data on many GPUs but for the sake of simplicity we will use a single GPU here. The process stays the same, but different tools would be used to schedule the jobs. For reference, below the commands, we have specified their approximate runtimes on a RTX 4090 for the French subset of MLS.

Perform voice activity detection:

parallel --eta -j3 python -m whisperspeech.vad {} ::: ./*.tar
# 50min

Extract speaker embeddings for each fragment:

parallel --eta -j2 python -m whisperspeech.extract_spk_emb --batch_size 16 {} ::: ./*.tar
# 1h 10min

We perform VAD segment merging (we do it as a separate step here to remove all randomness and get reproducibility for later steps):

parallel --eta -j16 python -m whisperspeech.vad_merge --eqvad {} ::: *.tar
parallel --eta -j16 python -m whisperspeech.vad_merge {} ::: *.tar

With that covered we can start the heavy lifting with the transcripts:

parallel --eta -j1 python -m whisperspeech.prepare_t2s_txts --transcription_model medium --language fr --batch_size 32 {} ::: *.tar
# 6h 48min

Afterwards comes Encodec compression:

parallel --eta -j2 python -m whisperspeech.prepare_s2a_atoks --batch_size 4 {} ::: *.tar
# 2h

Now we can extract the semantic tokens for both the T2S (eqvad) and S2A (maxvad) training:

parallel --eta -j1 python -m whisperspeech.extract_stoks --batch_size 16 --vq_model ../nbs/vqmodel-medium-en+pl-512c-dim64.model {} ::: *.tar
parallel --eta -j1 python -m whisperspeech.extract_stoks --kind eqvad --batch_size 16 --vq_model ../nbs/vqmodel-medium-en+pl-512c-dim64.model {} ::: *.tar
# 3h 45min

Splitting out the validation set(s)

After we have all the samples we may want to extract some validation sets. There are many ways to do it but here we’ll manually choose some speakers we’ll later skip completely during training.

We start by dumping all the sample ids:

parallel tar tf {} ::: stoks/*-atoks-3kbps-*.tar.gz | sed -e 's/\.atoks\.npy//' > all-samples-maxvad
parallel tar tf {} ::: stoks/*-small.en-txt-*.tar.gz | sed -e 's/\.txt//' > all-samples-eqvad
wc -l all-samples-maxvad

Because the sample ids (which are the original file paths) have speaker ids in them we can make a quick histogram:

< all-samples-maxvad awk -F_ '{ print $1; }'|sort|uniq -c|sort -n|less

From the result we can copy and paste 10 speaker ids of around 50 samples each to get 512 validation samples. We’ll exclude them from the training set because we want to validate on unseen speakers. We have to repeat this process for both splits (maxvad and eqvad since they have’ll different sample counts and ids):

< all-samples-maxvad grep 'train/1579\|train/2033\|train/3182\|train/12981\|train/2284\|train/2297\|train/6348\|train/7200\|train/7679\|train/1989' >
unseen-speakers-maxvad
< all-samples-eq grep 'train/1579\|train/2033\|train/3182\|train/12981\|train/2284\|train/2297\|train/6348\|train/7200\|train/7679\|train/1989' > unseen-speakers-eqvad

Once we have all the ids we can rescan the whole dataset once and split out the validation samples to separate webdataset shards to make validation fast:

python -m whisperspeech.split_out_val_datasets *-atoks-* unseen-speakers-maxvad
python -m whisperspeech.split_out_val_datasets '*-txt-*' unseen-speakers-eqvad
cd stoks && python -m whisperspeech.split_out_val_datasets '*-maxvad-stoks-*' ../unseen-speakers-maxvad
cd stoks && python -m whisperspeech.split_out_val_datasets '*-eqvad-stoks-*' ../unseen-speakers-eqvad

We can use wc -l all-samples-maxvad to find out how many samples we have.

Creating the dataset configuration files for training

Finally we create the configuration files for the training script:

cat > mls-fr-t2s-train.dataset <<EOF
multilingual-librispeech-webdataset/*-medium-txt-*.tar.gz multilingual-librispeech-webdataset/vq-en+pl/ 390203 --txt_kind='medium-txt' --language=fr --exclude_files multilingual-librispeech-webdataset/unseen-speakers-eqvad
EOF
cat > mls-fr-s2a-train.dataset <<EOF
multilingual-librispeech-webdataset/*-atoks-*.tar.gz multilingual-librispeech-webdataset/vq-en+pl/ 338362  --language=fr --exclude_files multilingual-librispeech-webdataset/unseen-speakers-maxvad
EOF
cat > mls-fr-s2a-val-unseen-speakers.dataset <<EOF
multilingual-librispeech-webdataset/unseen-speakers-maxvad.tar.gz multilingual-librispeech-webdataset/vq-en+pl/ 512 --language fr
EOF
cat > mls-fr-t2s-val-unseen-speakers.dataset <<EOF
multilingual-librispeech-webdataset/unseen-speakers-eqvad.tar.gz multilingual-librispeech-webdataset/vq-en+pl/ 512 --txt_kind 'medium-txt' --language fr
EOF

Why WebDataset?

All WhisperSpeech training and preproc code got reorganized around webdatasets. Webdatasets are just simple tar files that store all our data samples (files) but they are great for working with very large datasets. Inside these tar files we can store multiple files per sample in any format we want (e.g. the speech mp3/flac/wav files, the text transcripts, tokens in numpy arrays). For example from the data used to train the S2A model we have:

$ tar tf whisperspeech-s2a-512c-dim64/librilight-small-000.tar.gz |head -6
small/1874/shortlifelincoln_0809_librivox_64kb_mp3/shortlifeoflincoln_10_nicolay_64kb_021.atoks.npy
small/1874/shortlifelincoln_0809_librivox_64kb_mp3/shortlifeoflincoln_10_nicolay_64kb_021.stoks.npy
small/28/amateur_cracksman_librivox_64kb_mp3/amateur_cracksman_04_hornung_64kb_004.atoks.npy
small/28/amateur_cracksman_librivox_64kb_mp3/amateur_cracksman_04_hornung_64kb_004.stoks.npy
small/1874/shortlifelincoln_0809_librivox_64kb_mp3/shortlifeoflincoln_10_nicolay_64kb_052.atoks.npy
small/1874/shortlifelincoln_0809_librivox_64kb_mp3/shortlifeoflincoln_10_nicolay_64kb_052.stoks.npy

The name of the file is the same as the file name of the original dataset sample and the extensions tell us what kind of value they hold and in which format.

Furthermore we can split the whole dataset into fixed-size tar files called shards and load them on demand without unpacking. It turns out that this is exactly what we need for both AI training and data preprocessing:

  • for training we start a multiple CPU workers in parallel, open different shards in each, stream the data sequentially from disk (fast), decode it independently and them shuffle the samples we receive from each worker to create varied training batches
  • for preprocessing we independently send each shard to a worker and save all the results in a new webdataset shard

Reading samples sequentialy allows us to simply compress the whole file with gzip and offers best performance even on spinning or network disks.

Note

For the Juwels cluster there is another crucial benefit. There is a pretty low limit on the total number of files on network disks (inodes to be precise) so there is a strong preference to keep data in a few large files. The network file system performance is also better if we don’t have to open too many files.

Keeping each shard around 5GB seems to work great (the processed shards will likely be a lot smaller but it’s a lot easier to keep a 1-to-1 shard mapping). For the almost 4TB LibriLight dataset this translates to 625 files.

We found it quite useful to also keep all the data in some splits. This is data dependent but for LibriLight we followed the original split (small, medium, large) but also extracted the 6454 speaker from the large split because it is was the largest single speaker dataset and it allowed us to use it during development without downloading the full 4TB.

Caution

The sample file names should not have dots in them, otherwise the WebDataset code gets confused which files go together into one sample. This can be worked around later but it’s easiest if we just do .replace('.', '_') when storing the initial raw dataset.

Joins on WebDatasets

One novel functionality we developed for this project is the capability to join multiple preprocessed webdatasets. This mechanism relies on keeping a constant ordering of samples in a shard and ensuring 1-to-1 correspondence between the input and output shards during preprocessing.

Example usage:

ds = wds.WebDataset([str(x) for x in Path('librilight/').glob('*.tar')]).compose( # load all audio shards
    wds.decode(wds.torch_audio), # decode the audio data
    vq_stoks.merge_in( # merge another WebDataset
        # for each audio (`raw`) shard, find the path and name of a corresponding `vad` shard
        vq_stoks.derived_dataset('librilight-processed/', 'vad')
    ),
)

derived_dataset creates for us a helper function that returns an opened derived dataset given the original shard file name:

def derived_dataset(path, kind):
    def deriver(url):
        url = str(Path(path)/(Path(url).name.replace("raw", kind) + ".gz"))
        return wds.WebDataset(wds.SimpleShardList([url])).decode()
    return deriver

This feature is experimental and the API may change as we develop more experience with this merging style.

Examples of preprocessing runs

An example of running a preprocessing step locally on a single file:

mkdir -p guttenberg-preproc && cd guttenberg-preproc
python -m whisperspeech.vad ../guttenberg-audiobooks/guttenberg-audiobooks-raw-000010.tar

This will generate a file named guttenberg-audiobooks-vad-000000.tar.gz in the guttenberg-preproc directory.

On the cluster we can run multiple jobs in parallel (24 in this case), each processing one input shard. Since each job is pretty short (around 30 minutes) it’s easier for the scheduler to squeeze these between longer and higher-priority jobs.

mkdir -p whisperspeech-s2a-512c-dim64 && cd whisperspeech-s2a-512c-dim64
find ../librilight/ -name 'librilight-small-*.tar'| ~/clapa1/run-batch 24 \
    'python -m whisperspeech.prepare_s2a_dataset $FILE ../librilight-preproc
            --vq_model ~/clapa1/scratch/vqmodel-512c-dim64-4e-hyptuned-32gpu.model
            --batch_size 8'

The prepare_s2a_dataset script is taking raw audio data from the input file, automatically finding corresponding shards with VAD results in ../librilight-preproc and writing the results to the whisperspeech-s2a-512c-dim64 directory.

Voice activity detection

Code: 1B. Voice activity detection

Right now we are using the VAD model from WhisperX that is enough to avoid cutting audio in the middle of a word which would hurt automated transcriptions quite a lot. For more fancy datasets with multiple speakers we could use pyannote for it’s detection of multiple people speaking at once and diarization capability.

We later merge the VAD segments into longer chunks for more efficient training (less padding == higher efficiency). The code and histogram plots can be found in 2A. Whisper quantization dataset preparation

Transcription

Code: 5A. T2S dataset preparation

For training the TTS model (T2S) we are using running batches of chunked speech segments though FasterWhisper. We use the small.en model since there seems to be little benefit from using the larger models on English speech. For multilingual TTS we would probably want to switch to large-v2.

Note

Right now we extract both semantic tokens and transcriptions in one go. Doing the transcriptions is very time consuming are the result is unlikely to change. OTOH we may want to regenerate the semantic tokens if we train different quantized Whisper models. Because of that we may want to split this into two separate steps and only merge the results just before we generate the training dataset.

Acoustic token extraction

Code: 4A. S2A dataset preparation

This is basically the same as T2S above but with Encodec instead of Whisper.

Train/validation split

We create validation splits differently for each dataset. For example for LibriLight we use the speaker labels to create a common and unseen speakers splits. Once we have a list of samples we want to use we extract them from the full dataset into a new shard while keeping a list of IDs to skip during training. This way we avoid copying the training samples.

This has the downside of delaying all shuffling until training. This is especially problematic for smaller datasets with not enough shards since multiple workers may read the same shard and initially (before the shuffling buffer is filled) deliver the same samples multiple times. This causes overfitting. This is not a problem early in training (the model is too random to overfit) and we make sure we don’t reset the dataloaders between epochs but it is causing issues when resuming training from a checkpoint. The workaround is to preload the shuffling bufferwith a lot of samples (.shuffle(initial=20000)). Unfortunately it has the downside of putting a lot of load on the filesystem and adding a significant delay before training can start.